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Abstract 

It has been shown that minimum free energy structure for RNAs and RNA-RNA interaction 
is often incorrect due to inaccuracies in the energy parameters and inherent limitations of the en- 
ergy model. In contrast, ensemble based quantities such as melting temperature and equilibrium 
concentrations can be more reliably predicted. Even structure prediction by sampling from the 
ensemble and clustering those structures by Sfold [7] has proven to be more reliable than mini- 
mum free energy structure prediction. The main obstacle for ensemble based approaches is the 
computational complexity of the partition function and base pairing probabilities. For instance, 
the space complexity of the partition function for RNA-RNA interaction is 0(n 4 ) and the time 
complexity is 0(n 6 ) which are prohibitively large [H [JJ]- Our goal in this paper is to give a 
fast algorithm, based on sparse folding, to calculate an upper bound on the partition function. 
Our work is based on the recent algorithm of Hazan and Jaakkola [TO] . The space complexity 
of our algorithm is the same as that of sparse folding algorithms, and the time complexity of 
our algorithm is 0(MFE(n)£) for single RNA and 0(MFE(m,n)£) for RNA-RNA interaction 
in practice, in which MFE is the running time of sparse folding and £ < n (£ < n + m) is a 
sequence dependent parameter. 



1 Introduction 

Since the turn of the millennium and the advent of high throughput biology in the post-genome 
era, startling discoveries have redefined the role of RNA as a key player in the cellular arena. 
The ribosome and spliceosome are essentially two major RNA machines that together with other 
structural RNAs such as microRNAs (miRNA), long intergenic non-coding RNAs (lincRNA), small 
bacterial non-coding RNAs, and many other categories of structural RNAs run the cell at an extent 
that is comparable to that of protein machinery. For instance, lincRNAs have been recently shown 
to play sophisticated regulatory roles in mammalian cells, and miRNAs play a significant role in 



1 



the development of cancer. These discoveries have put RNA together with proteins in the center of 
focus for research and therapeutic purposes, including personalized medicine. Humanity has just 
begun to unravel RNA's complicated roles in living cells, and RNA is no longer considered a mere 
information medium from DNA to proteins, but it is rather in the center of attention in molecular 
and cellular biology research. RNA molecules often function through interaction with other RNAs. 
In the absence of high throughput experimental assays to observe RNA structure and RNA-RNA 
interactions, the problems of RNA structure prediction and RNA-RNA interaction prediction gain 
the highest priority in bioinformatics. 

RNA structure and RNA-RNA interaction prediction have recently received significant atten- 
tion. The majority of algorithms that have been developed predict the minimum free energy 
structure [1] or binding sites |21j . However, it is a well-known fact that minimum free energy 
structure is often incorrect due to inaccuracies in the energy parameters and inherent limitations of 
the energy model. On the other hand, it has been shown that thermodynamic quantities, such as 
melting temperature and equilibrium concentrations, that are derived from the partition function 
which captures the properties of the whole Boltzmann ensemble rather than those of the most 
likely structure, can be more reliably predicted [3] . Even structure prediction by sampling from the 
ensemble and clustering those structures by Sfold [7] has proven to be more reliable than minimum 
free energy structure prediction [51 [6] . 

The main obstacle for ensemble based approaches is the computational complexity of the par- 
tition function and base pairing probabilities. For instance, the space complexity of computing 
the partition function for RNA-RNA interaction is 0(n 4 ) and the time complexity is 0(n 6 ) which 
are prohibitively large [U [12]. On the other hand, recent progress in sparse folding algorithms 
has provided fast algorithms for the prediction of the most likely (minimum free energy) structure 
[2| 120} [23]. Although the partition function cannot be calculated exactly using sparsification ideas, 
it may be approximated. Our goal in this paper is to give a fast algorithm, based on sparse folding, 
to calculate an upper bound on the partition function. Our work is based on the recent algorithm 
of Hazan and Jaakkola |10| . 

2 Related Work 

Methods to approximate the partition function for interacting RNAs have not been proposed in the 
literature. Instead, methods for exact comutation of the partition function have been developed, 
having high both time and space complexity. Most notably, [2) developed an 0(n 6 )-time and 
0(n 4 )-space dynamic programming algorithm that computes the partition function of RNA-RNA 
interaction complexes, thereby providing detailed insights into their thermodynamic properties. 
|13| has developed a sampling algorithm that produces a Boltzmann weighted ensemble of RNA- 
RNA interaction structures for the calculation of interaction probabilities (and not the partition 
function) for any given interval on the target RNAs. 

In the context of single RNA secondary structure prediction, [16] devised a Metropolis Monte 
Carlo algorithm, called "Wang and Landau" algorithm [27], to approximate the partition func- 
tion as well as density of states. Although the computation of the partition function over all 
secondary structures and over all pseudoknot-free hybridizations can be done by efficient dynamic 
programming algorithms, the real advantage of [16] is in approximating the partition function where 
pseudoknotted structures are allowed; a context known to be NP-complete [17] . 

In the machine learning community, there has been extensive research on obtaining non- 
deterministic and deterministic approximations of the log-partition function. Firstly, sampling 
and Monte Carlo methods (e.g. Gibbs Sampling and Monte Carlo Markov Chain) have been devel- 
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oped as non-deterministic approaches for estimating the partition function (cf. [15] and references 
therein). In high dimensions, obtaining independent samples from a given distribution is difficult 
since the mixing time is typically exponential in the size of the problem. Therefore, these methods 
are computationally very demanding, and in practice, they are rarely applied to large-scale prob- 
lems. Secondly, variational techniques have been extensively developed as a deterministic approach 
to efficiently estimate the partition function in large-scale problems. In this approach, a simpler 
distribution is optimized as an approximation to the true distribution in a KL-divergence sense. 
However, the difficulty of this approach comes from: (i) Non-convexity of the set of feasible distri- 
butions (e.g. in "mean field" approximation [H]), and/or (ii) Hardness of computing the entropy 
embedded in the KL objective. Variational upper bounds on the other hand are convex, usually 
derived by replacing the entropy term in the KL objective with a simpler surrogate function and 
relaxing constraints on sufficient statistics hence convexifying the set of feasible distributions [26J . 

The basis of our work is [10J which provides a framework for approximating and bounding the 
partition function using MAF^J inference (i.e. prediction of the most likely structure) on randomly 
perturbed models. Particularly, they propose to estimate the partition function as the max-statistics 
of collections of random variables, which is a major topic in extereme value statistics (e.g. see [3]). 
More broadly, there is an emerging body of work on perturbation methods, showing the benefits of 
explicitly adding noise into the modeling, learning, and inference pipelines \22\ [25] . 

3 Preliminaries 

3.1 Notation 

The input nucleic acid sequences are denoted by R and S throughout this paper. Function L 
denotes the length of the input sequence, and R is indexed from 1 to L(R), and S is indexed from 1 
to L(S) both in 5' to 3' direction. We refer to the i th nucleotide in R and S by i^ and is respectively. 
The subsequence from the i th nucleotide to the j th nucleotide in a strand is denoted by [i,j]. An 
intramolecular base pair between the nucleotides i and j in a strand is called an arc and denoted 
by a bullet % • j. An intermolecular base pair between the nucleotides in and is is called a bond 
and denoted by a circle in o ig. We denote an RNA (RNA-RNA interaction) secondary structure 
by s, which is mathematically a set of constituent base pairs (arcs and bonds). The collection of 
all such feasible structures is denoted by 6. 

Throughout this paper, we denote the partition function by 

Q:=£ e_W ' (1) 

see 

in which G(s) is the free energy of s, R is the gas constant, and T is temperature. For a more 
detailed presentation of the partition function see [19[ 0]. We use the Turner energy model for 
single RNAs [18] and our energy model for RNA-RNA interaction [3]. 

In this paper, we consider only the canonical base pairing system, i.e. each nucleotide is Watson- 
Crick paired with at most one nucleotide. We also assume there are no pseudoknots in individual 
secondary structures of R and S, and there are no crossing bonds and zigzags between R and S [4]. 
However, an extension of our ideas to non-canonical base pairing systems |11] and pseudoknotted 
(crossing and zigzagged) structures [24] is straight forward. The key requirement for such an 
extension is the existence of a fast minimum free energy structure prediction algorithm that can 
incorporate per base-pair energy contributions for the considered class of structures. 

1 Maximum a posteriori 
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3.2 Energy Perturbations 

Following the Hazan-Jaakkola's approach [ID], let {7^.^}, {li R ,j R },{li s »js}i {7i s .j s }, {7i fl oi s }, 
and {7j fl0 j s } be six families of independent and identically distributed (i.i.d.) random variables, 
which are energy perturbations corresponding to presence and absence of base pairs, following the 
Gumbel distribution whose cumulative distribution function is 



F(x) := P[7 < x] = e 



-(.x + C) 



(2) 



Above, C is the Euler's constant, so that the mean of our Gumbel distribution defined in ([2]) is 
zero. For every structure s € &, let the energy perturbation of a structure be 



7(s) = 1*3 + Y 7i »i 



if s is single RNA structure and 

7(a) = 



Y 7iB»iH+ S ^s-w+ Z Tiff-. 

iR'jR& iR'jRgs is'JS^S 



.IS 



+ 2 TiflOi s + 



1 



if s is RNA-RNA interaction structure. 



(3) 



(4) 



4 Upper Bound on the Partition Function 

Corollary 1 in [IDJ states that 



log Q < £L 



max{-G(s)/ J RT + 7 (s)} 

SS6 



-S-v min{G(s)/ J RT-7(s)} 
sG6 



(5) 



in which £7 is the expectation with respect to 7's. The perturbations 7 include terms that depend 
on base pairs (7.) and terms that depend on lack of base pairs (7.). To simplify the incorporation 
of 7 terms into the energy model, let 



H*)= Y ^'3 -limj) 



(6) 



for single RNA structure and 



+ Y (7* R°is 7i H oi s J 



(7) 



for RNA-RNA interaction structure. Since the difference between two random variables following 
the Gumbel distribution follows the logistic distribution, we can rewrite the single RNA A(s) as 



A ( s ) = Y Xi 'i 



(8) 
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and the RNA-RNA interaction one as 



A(s) — ^iR»jn + ^is»js + ^iflois' (^) 

«S»is6s «flOi s gs 

where A's are independent identically distributed random variables following the logistic distribu- 
tion. In that case, ([3]) and ([U]) imply 

7(s) = ^7 J . i + A(s) (10) 

and (jl]) and ([7]) imply 

7(5) = iiR°is + E + E ^s-is + A ( s )- ( n ) 

i R oi s iR'ja is'is 

Since X)j«j7t«j an d Sioj7«oj are constants inside the minimization in (j^j), whose expectations are 
zero because of ([2~]). we can rewrite © as follows where A follows the logistic distribution: 



logQ< -E x 



mm{G{s)/RT - A(s)} 

se6 



(12) 



Our algorithm computes the right hand side of (|12p and calculates the upper bound = 
exp(— E\ [min se g{G(s)/i?T — A(s)}]). The minimization inside the expectation is essentially min- 
imum free energy prediction, albeit with a perturbed energy. Recall that the energy perturbation 
— A(s) is the sum of individual base-pair perturbations for all base-pairs in 5; therefore, incorpora- 
tion of such a perturbation in fast minimum free energy prediction algorithms, such as [2] which 
exploits sparsity, is straight forward. Particularly, we only need to add to the calculation of 

L c (i,j) in [2] when i and j can form a base-pair. Additionally, the scaling of energy by RT in (|12p 
has to be carefully applied to the sparse algorithm. Similarly, we only need to add —\i RO i s to the 
calculation of hybrid components in [23], in addition to proper handling of — Xi R »j R and — Xi s »j s in 
intramolecular base-pairings. 



4.1 Complexity Analysis 

Note that the perturbed energy is such that the triangle inequality in Property 1 of [2] still holds. 
Therefore, the running time of all sparse folding algorithms based on the triangle inequality [2j 
[20} [23] is not affected by our energy perturbation. To calculate the expectation above, we sample 
A's independently from the logistic distribution until the estimation of the expectation by simple 
averaging converges. Our experimental results show that the number of samples needed is much 
lower than the size of the input sequence. 

For a single RNA with length n, the time complexity of our upper bound algorithm is 0([n 2 + 
MFE(n)]£) in which I < n is the number of samples needed for the expectation estimation to 
converge, and MFE(n) is the running time of minimum free energy prediction. In this case, 
the space complexity is 0(n 2 + MFES(n)), in which MFES{n) is the memory space needed for 
minimum free energy prediction. 

For RNA-RNA interaction with lengths m and n, the time complexity of our algorithm is 
O ([m 2 + n 2 + MFE(m,n)^ t\ in which I < n + m is the number of samples needed for the ex- 
pectation estimation to converge, and MFE(m, n) is the running time of minimum free energy 
prediction. In this case, the space complexity is 0(m 2 + n 2 + MFES(m, n)), in which MFES(m, n) 
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is the memory space needed for minimum free energy prediction. Usually the running time and 
space complexity of our upper bound are dominated by those of minimum free energy prediction; 
therefore in practice, the time complexity of our upper bound is 0(MFE(n)£) for single RNA and 
0(MFE(m, n)£) for RNA-RNA interaction, and its space complexity is often the same as that of 
minimum free energy prediction. 

5 Results 

We implemented the upper bound algorithm in our piRNA package [1]. To test the overestimation 
of partition function and also the number of samples needed for the algorithm to converge, we 
randomly selected totally 273,512 RNA sequences from Rfam 11.0 database [8j [9] and computed 
the upper bounds for them. Since the partition functions are often large numbers, we report the 
ensemble energies. The ensemble energy is proportional to the minus log of the partition function 
and an overestimation of the partition function becomes an underestimation of the ensemble energy. 
Fig. Q] depicts the results of our experiment. The top plot shows the histogram of ensemble energy 
underestimation calculated by —RT log Q + RT log Q u {,. The middle plot shows the histogram of 
ensemble energy underestimation percentage calculated from the ratio log Q u b/ log Q — 1. This 
plot shows that for a vast majority of cases this ratio is below 40%. The bottom plot depicts the 
distribution of ensemble energy —RT log Q in the input dataset. Although this distribution exhibits 
multiple peaks, the middle distribution, which is the underestimation percentage, has a unimodal 
behaviour. Out of 273,512 RNAs, for 249,622 of them the underestimation is less than 50% of their 
ensemble energy, and for about half of the sequences (148,762) this difference is smaller than 30%. 
The number of sequences for which this difference is negative is 2,356 which is less than 1% of the 
total number of RNAs. 

Fig. [2] shows the performance of our algorithm. The top plot is the histogram of the number 
of samples i (iterations) needed for the algorithm to converge. The middle and bottom plots show 
the histogram of the number of iterations per input size and per the log of input size. The vast 
majority of sequences required less than 15% of their length iterations to converge. The number 
of iterations starts with 7 and almost all the sequences need less than 40 samples. 243,119 or 89% 
of RNAs need not more than 20 iterations, and for more than half of the sequences (153,515) the 
experiment has been done with less than 15 iterations. Therefore at most 40 iterations are enough 
for different Rfam RNAs with different lengths. The number of iterations per length ratio for most 
of the sequences is less than 30%. For 90% of the RNAs in this dataset, this number is between 
7% and 25%. Clearly the relation between length and the number of iterations is not linear, and 
upper bounds for different RNAs with the same length require different number of iterations. 

Recall that the space complexity of our algorithm is the same as that of sparse minimum free 
energy prediction. Therefore, our algorithm is both fast and space efficient in practice. 

6 Conclusion 

We gave a fast algorithm, based on the recent algorithm of Hazan and Jaakkola |10j . to iteratively 
compute an upper bound on the partition function of nucleic acids by perturbing energy. Our upper 
bound algorithm uses a fast minimum free energy prediction in each iteration. Our algorithm 
preserves the properties on which sparsification methods rely; therefore, we minimally modified 
sparse folding algorithms [21 [20], [23] to obtain the required fast minimum free energy prediction. 

For the lower bound, one can trivially use the single term corresponding to the minimum 
free energy. The lower bound algorithm of Hazan and Jaakkola [10] requires modification to be 
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Figure 1: Histograms of (top) ensemble energy underestimation —RT log Q + RT log for 
all 273,512 sequences in our dataset, (middle) percentage of ensemble energy underestimation 
^ogQ u b/logQ — 1, and (bottom) ensemble energy —RTlogQ in the input dataset. 
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Figure 2: Histograms of (top) number of iterations t needed for all sequences in our dataset, 
(middle) number of iterations as a percentage of the input sequence length, and (bottom) number 
of iterations per the log of the input sequence length. 



applicable to our problem. We leave such more accurate lower bound algorithms for future work. 
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